state size
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel (0.04)
- Asia > China (0.04)
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch.This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA).Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size.This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size.Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in ``finetuning pretrained Transformers to RNNs'' (T2R) settings, reducing the need for extensive training from scratch.Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (9 more...)
- Education (0.92)
- Health & Medicine (0.68)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (9 more...)
- Education (0.92)
- Health & Medicine (0.68)
Formulating Discrete Probability Flow Through Optimal Transport
Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases.
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel (0.04)
- Asia > China (0.04)
StateX: Enhancing RNN Recall via Post-training State Expansion
Shen, Xingyu, Chen, Yingfa, Thai, Zhen Leng, Han, Xu, Liu, Zhiyuan, Sun, Maosong
While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities. Recently, recurrent neural networks (RNNs) such as gated linear attention (GLA) (Y ang et al., 2024) and Mamba2 (Dao & Gu, 2024) have shown promising capabilities in language modeling. These architectures have constant per-token complexity, while the more popular Transformer architecture (V aswani et al., 2023) has per-token complexity that grows linearly with the context length.
Scaling Context Requires Rethinking Attention
Gelada, Carles, Buckman, Jacob, Zhang, Sean, Bach, Txus
We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
- North America > United States (0.28)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- (4 more...)